Data Cleaning and Manipulation with R

R Introduction Workshop

March 15, 2018

Hi there!

Find the materials for this presentation here:
github.com/karaesmen/WIA2018_data_cleaning

You can find out more about R-Ladies:
Meetup.com: meetup.com/RLadies-Columbus/
Twitter: @RLadiesColumbus

R for Data Science - Data Science

What is data science? A buzzword with a blurry definition…

R for Data Science - Big Data

What is big data?
Another buzzword with a blurry definition?

R for Data Science - Big Data

R for Data Science - Big Data

Is your “big data” actually big data?

R for Data Science - Why R is so awesome

R offers great tools for the entire Data Science process

R for Data Science - Why R is so awesome

And other cool things like animated plots:

R for Data Science - Why R is so awesome

Or interactive plots:

library(plotly)
p <- iris %>% 
    ggplot(aes(x=Sepal.Length, y=Sepal.Width, color=Species)) +
    geom_point() +
    geom_smooth()

ggplotly(p)    

R for Data Science - Why R is so awesome

R community #rstats on Twitter

R for Data Science - The Book and tidyverse

Tidying and manipulating

In most cases, data doesn’t come clean (pun intended)… Requires cleaning before any modelling or visualization.

“Happy families are all alike; 
every unhappy family is unhappy in its own way.” 
–– Leo Tolstoy
“Tidy datasets are all alike, 
but every messy dataset is messy in its own way.” 
–– Hadley Wickham
Luckily there is a nice set of packages offering help with messy data sets

What is tidy data?

Is this data tidy?

Pregnant Not.Pregnant
Male 0 5
Female 1 4

Answer: This data can be tidier

Pregnant Not.Pregnant
Male 0 5
Female 1 4

A tidier version

sex pregnant n
male no 5
male yes 0
female no 4
female yes 1

Tidying messy datasets

Real datasets usually come with the following problems that needs tidying:

browseVignettes(package = "dplyr")
browseVignettes(package = "tidyr")

Column headers are values, not variable names - pew example

library(tidyverse)
pew <- read_csv("data/pew.csv")
head(pew)
religion <$10k $10-20k $20-30k $30-40k $40-50k $50-75k $75-100k $100-150k >150k Don’t know/refused
Agnostic 27 34 60 81 76 137 122 109 84 96
Atheist 12 27 37 52 35 70 73 59 74 76
Buddhist 27 21 30 34 33 58 62 39 53 54
Catholic 418 617 732 670 638 1116 949 792 633 1489
Don’t know/refused 15 14 15 11 10 35 21 17 18 116
Evangelical Prot 575 869 1064 982 881 1486 949 723 414 1529
gather(data=pew, key=income, value=n, -religion)

If we would use the pipe %>% same code would look like this:

pew %>%
    gather(key=income, value=n, -religion) %>%
    head
religion income n
Agnostic <$10k 27
Atheist <$10k 12
Buddhist <$10k 27
Catholic <$10k 418
Don’t know/refused <$10k 15
Evangelical Prot <$10k 575

This form is tidy because each column represents a variable and each row represents an observation, which is in this case a demographic unit corresponding to a combination of religion and income.

Column headers are values, not variable names - billboard

This format is also used to record regularly spaced observations over time.

billboard <- read_csv("data/billboard.csv")
billboard[1:3, 1:10]
year artist track time date.entered wk1 wk2 wk3 wk4 wk5
2000 2 Pac Baby Don’t Cry (Keep… 04:22:00 2000-02-26 87 82 72 77 87
2000 2Ge+her The Hardest Part Of … 03:15:00 2000-09-02 91 87 92 NA NA
2000 3 Doors Down Kryptonite 03:53:00 2000-04-08 81 70 68 67 66

Again, to tidy this dataset, we first gather together all the wk columns. The column names give the week and the values are the ranks:

billboard2 <- billboard %>% 
  gather(week, rank, wk1:wk76, na.rm = TRUE)
billboard2 %>% head
year artist track time date.entered week rank
2000 2 Pac Baby Don’t Cry (Keep… 04:22:00 2000-02-26 wk1 87
2000 2Ge+her The Hardest Part Of … 03:15:00 2000-09-02 wk1 91
2000 3 Doors Down Kryptonite 03:53:00 2000-04-08 wk1 81
2000 3 Doors Down Loser 04:24:00 2000-10-21 wk1 76
2000 504 Boyz Wobble Wobble 03:35:00 2000-04-15 wk1 57
2000 98^0 Give Me Just One Nig… 03:24:00 2000-08-19 wk1 51

Here we use na.rm to drop any missing values from the gather columns. In this data, missing values represent weeks that the song wasn’t in the charts, so can be safely dropped.

In this case it’s also nice to do a little cleaning, converting the week variable to a number, and figuring out the date corresponding to each week on the charts:

billboard3 <- billboard2 %>%
  mutate(
    week = parse_number(week),
    date = as.Date(date.entered) + 7 * (week - 1)) %>%
  select(-date.entered)
billboard3 %>% head
year artist track time week rank date
2000 2 Pac Baby Don’t Cry (Keep… 04:22:00 1 87 2000-02-26
2000 2Ge+her The Hardest Part Of … 03:15:00 1 91 2000-09-02
2000 3 Doors Down Kryptonite 03:53:00 1 81 2000-04-08
2000 3 Doors Down Loser 04:24:00 1 76 2000-10-21
2000 504 Boyz Wobble Wobble 03:35:00 1 57 2000-04-15
2000 98^0 Give Me Just One Nig… 03:24:00 1 51 2000-08-19

Finally, it’s always a good idea to sort the data. We could do it by artist, track and week:

billboard3 %>% arrange(artist, track, week)

Or by date and rank:

billboard3 %>% arrange(date, rank) %>% head
year artist track time week rank date
2000 Lonestar Amazed 04:25:00 1 81 1999-06-05
2000 Lonestar Amazed 04:25:00 2 54 1999-06-12
2000 Lonestar Amazed 04:25:00 3 44 1999-06-19
2000 Lonestar Amazed 04:25:00 4 39 1999-06-26
2000 Lonestar Amazed 04:25:00 5 38 1999-07-03
2000 Lonestar Amazed 04:25:00 6 33 1999-07-10

Multiple variables stored in one column

After gathering columns, the key column is sometimes a combination of multiple underlying variable names.
>- Example dataset: tb (tuberculosis) >- From the World Health Organisation, and records the counts of confirmed tuberculosis cases by country, year, and demographic group.
>- The demographic groups are broken down by sex (m, f) and age (0-14, 15-25, 25-34, 35-44, 45-54, 55-64, unknown).

tb <- read_csv("data/tb.csv")
tb[1:3, 1:10]
iso2 year m04 m514 m014 m1524 m2534 m3544 m4554 m5564
AD 1989 NA NA NA NA NA NA NA NA
AD 1990 NA NA NA NA NA NA NA NA
AD 1991 NA NA NA NA NA NA NA NA

First we gather up the non-variable columns:

tb2 <- tb %>% 
  gather(key = demo, value = n, -iso2, -year, na.rm = TRUE)
tb2 %>% head
iso2 year demo n
AD 2005 m04 0
AD 2006 m04 0
AD 2008 m04 0
AE 2006 m04 0
AE 2007 m04 0
AE 2008 m04 0

Column headers in this format are often separated by a non-alphanumeric character (e.g. ., -, _, :), or have a fixed width format, like in this dataset. separate() makes it easy to split a compound variables into individual variables. You can either pass it a regular expression to split on (the default is to split on non-alphanumeric columns), or a vector of character positions. In this case we want to split after the first character:

tb3 <- tb2 %>% 
  separate(demo, c("sex", "age"), 1)
tb3 %>% head
iso2 year sex age n
AD 2005 m 04 0
AD 2006 m 04 0
AD 2008 m 04 0
AE 2006 m 04 0
AE 2007 m 04 0
AE 2008 m 04 0

Storing the values in this form resolves a problem in the original data. We want to compare rates, not counts, which means we need to know the population.
In the original format, there is no easy way to add a population variable.
It has to be stored in a separate table, which makes it hard to correctly match populations to counts.
In tidy form, adding variables for population and rate is easy because they’re just additional columns.

Variables are stored in both rows and columns

The most complicated form of messy data occurs when variables are stored in both rows and columns.

weather <- read_csv("data/weather.csv")
weather[1:3, 1:15]
id year month element d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11
MX17004 2010 1 tmax NA NA NA NA NA NA NA NA NA NA NA
MX17004 2010 1 tmin NA NA NA NA NA NA NA NA NA NA NA
MX17004 2010 2 tmax NA 27.3 24.1 NA NA NA NA NA NA NA 29.7

To tidy this dataset we first gather the day columns:

weather2 <- weather %>%
  gather(key = day, value = value, d1:d31, na.rm = TRUE)
weather2 %>% head
id year month element day value
MX17004 2010 12 tmax d1 29.9
MX17004 2010 12 tmin d1 13.8
MX17004 2010 2 tmax d2 27.3
MX17004 2010 2 tmin d2 14.4
MX17004 2010 11 tmax d2 31.3
MX17004 2010 11 tmin d2 16.3

At this point I am removing the missing values with na.rm.
This is ok because we know how many days are in each month and can easily reconstruct the explicit missing values.

We’ll also do a little cleaning:

weather3 <- weather2 %>% 
  mutate(day = parse_number(day)) %>%
  select(id, year, month, day, element, value) %>%
  arrange(id, year, month, day)
weather3 %>% head
id year month day element value
MX17004 2010 1 30 tmax 27.8
MX17004 2010 1 30 tmin 14.5
MX17004 2010 2 2 tmax 27.3
MX17004 2010 2 2 tmin 14.4
MX17004 2010 2 3 tmax 24.1
MX17004 2010 2 3 tmin 14.4
weather4 <- weather3 %>% 
    spread(key = element, value = value)
weather4 %>% head
id year month day tmax tmin
MX17004 2010 1 30 27.8 14.5
MX17004 2010 2 2 27.3 14.4
MX17004 2010 2 3 24.1 14.4
MX17004 2010 2 11 29.7 13.4
MX17004 2010 2 23 29.9 10.7
MX17004 2010 3 5 32.1 14.2

This form is tidy: there’s one variable in each column, and each row represents one day.

Multiple types in one table - billboard

The billboard dataset actually contains observations on two types of observational units: the song and its rank in each week. This manifests itself through the duplication of facts about the song: artist, year and time are repeated many times.

billboard3 %>%
    head
year artist track time week rank date
2000 2 Pac Baby Don’t Cry (Keep… 04:22:00 1 87 2000-02-26
2000 2Ge+her The Hardest Part Of … 03:15:00 1 91 2000-09-02
2000 3 Doors Down Kryptonite 03:53:00 1 81 2000-04-08
2000 3 Doors Down Loser 04:24:00 1 76 2000-10-21
2000 504 Boyz Wobble Wobble 03:35:00 1 57 2000-04-15
2000 98^0 Give Me Just One Nig… 03:24:00 1 51 2000-08-19

This dataset needs to be broken down into two pieces:
- a song dataset which stores artist, song name and time
- and a ranking dataset which gives the rank of the song in each week.

We first extract a song dataset:

song <- billboard3 %>% 
  select(artist, track, year, time) %>%
  unique() %>%
  mutate(song_id = row_number())
song %>% head
artist track year time song_id
2 Pac Baby Don’t Cry (Keep… 2000 04:22:00 1
2Ge+her The Hardest Part Of … 2000 03:15:00 2
3 Doors Down Kryptonite 2000 03:53:00 3
3 Doors Down Loser 2000 04:24:00 4
504 Boyz Wobble Wobble 2000 03:35:00 5
98^0 Give Me Just One Nig… 2000 03:24:00 6

Then use that to make a rank dataset by replacing repeated song facts with a pointer to song details (a unique song id):

rank <- billboard3 %>%
  left_join(song, c("artist", "track", "year", "time")) %>%
  select(song_id, date, week, rank) %>%
  arrange(song_id, date)
rank %>% head
song_id date week rank
1 2000-02-26 1 87
1 2000-03-04 2 82
1 2000-03-11 3 72
1 2000-03-18 4 77
1 2000-03-25 5 87
1 2000-04-01 6 94

One type in multiple tables

  1. Read the files into a list of tables.

  2. For each table, add a new column that records the original file name (the file name is often the value of an important variable).

  3. Combine all tables into a single table.

(paths <- dir("data", pattern = "iris*", full.names = TRUE))
#> [1] "data/iris1.txt" "data/iris2.txt" "data/iris3.txt"
map_df(paths, read_tsv) %>% head
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
basename(paths)
#> [1] "iris1.txt" "iris2.txt" "iris3.txt"
read_files <- function(x) {
    read_tsv(x) %>%
        mutate(file=basename(x))
}
map_df(paths, read_files) %>% head
Sepal.Length Sepal.Width Petal.Length Petal.Width Species file
5.1 3.5 1.4 0.2 setosa iris1.txt
4.9 3.0 1.4 0.2 setosa iris1.txt
4.7 3.2 1.3 0.2 setosa iris1.txt
4.6 3.1 1.5 0.2 setosa iris1.txt
5.0 3.6 1.4 0.2 setosa iris1.txt
5.4 3.9 1.7 0.4 setosa iris1.txt